Developer CD Series 1996 May: Tool Chest

home *** CD-ROM | disk | FTP | other *** search

/ Developer CD Series 1996 May: Tool Chest / Developer CD Series May 1996 (Tool Chest) (Apple Computer) (1996).iso / Tool Chest / Development Tools & Languages / Dylan Related / Mindy / Mindy 1.2 - Mac PPC / doc / string-extensions.doc < prev

Wrap

Text File | 1995-03-15 | 17.8 KB | 405 lines | [TEXT/MMCC]

The String-extensions Library Copyright (c) 1994 Carnegie Mellon University Introduction String-extensions is a library of routines for working with characters and strings. String-extensions exports these modules: Conversions This module consists of various useful conversions involving strings. Character-type This module is a Dylanized version of the C library ctype.h String-hacking This module exports miscellanous functions and data structures that are useful when working with strings and characters. Regular-expressions This module contains various functions that deal with regular expressions (regexps). Substring-search This module contains methods for searching for fixed substrings rather than general regular expressions. The Conversions Module The Conversions module consists of various useful conversions involving strings. They are: string-to-integer(string, #key base) => integer [Function] integer-to-string(integer, #key base) => string [Function] digit-to-integer(character) => integer [Function] integer-to-digit(integer) => character [Function] Base defaults to 10, and is the radix for the number system to convert from/to. Bases below 2 are errors, as are bases above 36. When converting from a string, the string must exactly describe a number, with no excess characters. Digit-to-integer will signal an error if the digit is non-alphanumeric. Errors will be signalled for all invalid input. as(<string>, character) [G.F. Method] Turns a character into the appropriate string of length one. The Character-type Module Character-type is a Dylanized version of the C library ctype.h It contains the following functions: FUNCTION AND ARG TYPE RETURNS #t FOR THESE CHARACTERS alpha?(character) a-zA-Z digit?(character) 0-9 alphanumeric?(character) a-zA-Z0-9 whitespace?(character) Space, tab, newline, formfeed, carriage return uppercase?(character) A-Z lowercase?(character) a-z hex-digit?(character) 0-9a-f punctuation?(character) ,./<>?;\:"|'[]{}!@#$%^&*()-=_+`~ graphic?(character) alphanumeric or punctuation printable?(character) graphic or whitespace control?(character) not printable String-hacking The String-hacking module exports miscellanous functions and data structures that are useful when working with strings and characters. add-last(stretchy-sequence, object) [Generic Function] => stretchy-sequence add-last(string, character) => string [G.F. Method] Like add except it's guarenteed to add the character to the end of the string. predecessor(character) => character [Function] Get the character before this character. Equivalent to as(<character>, -1 + as(<integer>, character)) successor(character) => character [Function] Get the character after this character. Equivalent to as(<character>, 1 + as(<integer>, character)) case-insensitive-equal(object1, object2) [Generic Function] case-insensitive-equal(string1, string2) [G.F. Method] case-insensitive-equal(character1, character2) [G.F. Method] Does a case insensitive equality test. Methods are provided only for strings and characters, not general collections. <character-set> [Sealed Abstract Class] <case-sensitive-character-set> [Class] <case-insensitive-character-set> [Class] A <character-set> is a non-mutable subclass of <collection>, and is conceptually an unordered set of characters. Dylan collection elements always have keys, so to fit sets into Dylan, the key of an element of a character set is the element itself. There are two instantiable subclasses of <character-set>, <case-sensitive-character-set> and <case-insensitive-character-set>. <character-set> is not instantiable; one must always specify one of the instantiable subclasses when creating a character set. There are two ways of making a character set. The first is a method for make using the description: keyword. The value that follows the description: keyword is a string that describes the set using a notation like a regular expression character set, except without the '[' and ']' delimiters. For example, make(<case-sensitive-character-set>, description: "a-z") would be the set of all lowercase alphabetic characters. A second way to create character sets is to use an "as" method. The as method basically takes a collection of characters and discards the keys of these characters. Example: as(<case-insensitive-character-set>, "abcdefghijklmnopqrstuvwxyz") is again the set of all lowercase alphabetic characters. It is important to realize that the as method does *not* take a description: as(<case-sensitive-character-set>, "a-z") returns the set of 'a', '-', and 'z', not the set of all alphabetic characters. The most useful operation on character sets is member?, which does what one would expect. Another useful operation is the forward-iteration-protocol. This basically calls member? on every possible character until it finds a character that is a member of the set. This means that in a <case-insensitive-character-set>, both 'a' and 'A' will come up. <byte-character-table> [Class] A byte-character-table is a vector that uses byte characters as indices instead of integers. The following are equivalent: regular-vector[as(<integer>, character)] byte-character-table[character] <byte-character-table> has absolutely no relation to <table>. It is simply a <mutable-explicit-key-collection>. Regular-expressions The Regular-expressions module contains various functions that deal with regular expressions (regexps). The module is based on Perl (version 4), and has the same semantics unless otherwise noted. The syntax for Perl-style regular expressions can be found on page 103 of Programming Perl by Larry Wall and Randal L. Schwartz. There are some differences in the way String-extensions handles regular expressions. The biggest difference is that regular expressions in Dylan are case insensitive by default. Also, when given an invalid regexp, String-extensions will produce undefined behavior while Perl would give an error message. There is some work involved in analyzing a regular expression, and if the same regexp is used repeatly with different target strings, this will result in wasted computation. For this reason, each basic function in the Regular-expression module comes with a companion function that makes using a regular expression more efficient when it is used more than once. For example, the regexp-replace function has the make-regexp-replacer companion function. There is one exception; the join function has no make-joiner function. The "make-fooer" will analyze the regular expression exactly once, and provide a function that makes use of this pre-analyzed regular expression. For example, the following two pieces of code yield the same result: regexp-position("This is a string", "is"); let is-finder = make-regexp-positioner("is"); is-finder("This is a string"); However, the second form is more efficient if is-finder is called multpile times. regexp-position [Generic Function] (big-string, regexp, #key start, end, case-sensitive) => variable-number-of-marks-or-#f This function returns the index of the start of the regular expression in the big-string, or #f if the regular expression is not found. As a second value, it returns the index of the end of the regular expression in the big-string (assuming it was found; otherwise there is no second value). If there are groups in the regular expression, regexp-position will return two more values (a start and an end) for each group. If the subgroup is matched, these will be integers. So regexp-position("This is a string", "is"); returns values(2, 4), and regexp-position("This is a string", "(is)(.*)ing"); returns values(2, 16, 2, 4, 4, 13), while regexp-position("This is a string", "(not found)(.*)ing"); returns #f. If the subgroup is not matched, however, both the start and the end will be #f. The marks are always given relative to the start of big-start, and not relative to the start: keyword. Start: and end: specify what part of big-string to look at, and they default to the beginning and end of the string, respectively. Case-sensitive defaults to false. make-regexp-positioner [Generic Function] (regexp, #key byte-characters-only, need-marks, maximum-compile, case-sensitive) => an anonymous positioner function method (big-string, #key start, end) Make-regexp-positioner can return several different types of positioners, and it is up to the user to specify what kind of positioner the user wants. By default, it returns a positioner that works like regexp-position. However, if need-marks is #f, it may give a positioner that only returns #t or #f, with no marks. (And then again, it may still return marks) If byte-characters-only is specified, the positioner may only work on big-strings that consist only of byte characters (characters whose numerical value is between 0 and 255, inclusive). And if maximum-compile is #t, it will take a long time to return a positioner, but the positioner will run really fast. regexp-replace [Generic Function] (big-string, search-for-regexp, replace-with-string, #key count, case-sensitive, start, end) => new-string This replaces all occurences of regexp in big-string with replace-string. If count: is specified, it replaces only the first count occurences of regexp. (This is different from Perl, which replaces only the first occurence unless /g is specified) Replace-string can contain backreferences to the regexp. For instance, regexp-replace("The rain in spain and some other text", "the (.*) in (\\w*\\b)", "\\2 has its \\1") returns "spain has its rain and some other text". If the subgroup referred to by the backreference was not matched, the reference is interpretted as the null string. For instance, regexp-replace("Hi there", "Hi there(, Bert)?", "What do you think\\1?") returns "What do you think?" because ", Bert" wasn't found. make-regexp-replacer [Generic Function] (regexp, #key replace-with, case-sensitive) => an anonymous replacer function that is either method (big-string, #key count, start, end) or method (big-string, replace-string, #key count, start, end) The first form is returned if the replace-with: keyword isn't supplied, otherwise the second form is returned. (There is no efficiency gained by supplying the replace-with string, but it might be convenient) translate(big-string, from-string, to-string, [Generic Function] #key delete, start, end) => new-string This is equivalent to Perl's tr/// construct. From-string is a string specification of a character set, and to-string is another character set. Translate converts big-string character by character, according to the sets. For instance, translate("any string", "a-z", "A-Z") will convert "any string" to all uppercase: "ANY STRING". Like Perl, character ranges are not allowed to be "backwards". The following is not legal: translate("any string", "a-z", "z-a") (This restriction may be removed in future releases) Unlike Perl's tr///, translate doesn't return the number of characters translated. If delete: is true, any characters in the from-string that don't have matching characters in the to-string are deleted. The following will remove all vowels from a string and convert periods to commas: translate("any string", ".aeiou", ",", delete: #t) Delete: is false by default. If delete: is false and there aren't enough characters in the to-string, the last character in the to-string is reused as many times as necessary. The following converts several punctuation characters into spaces: translate("any string", ",./:;[]{}()", " "); Start: and end: indicate which part of the string. They default to the entire string. Caveats: Translate is always case sensitive. translate [G.F. Method] (big-byte-string, from-byte-string, to-byte-string, #key delete, start, end) => new-string The only method of translate operates only on byte strings. make-translator [Generic Function] (from-string, to-string, #key delete) => an anonymous translator method (big-string, #key start, end) => new-string Does what you'd expect it to. make-translator [G.F. Method] (from-byte-string, to-byte-string, #key delete) => an anonymous translator method (big-string, #key start, end) => new-byte-string Again, the existing method on make-translator only handles byte strings. split [Generic Function] (regexp, big-string, #key count, remove-empty-items, case-sensitive, start, end) => a variable number of strings This is like Perl's split function. It searchs big-string from occurences of regexp, and returns substrings that were delimited by that regexp. For instance, split("-", "long-dylan-identifier") returns values("long", "dylan", "identifier"). Note that what matched the regexp is left out. Remove-empty-items, which defaults to true, magically skips over empty items, so that split("-", "long--with--multiple-dashes) returns values("long", "with", "multiple", "dashes"). Count is the maximum number of strings to return. If there are n strings and count is specified, the first count - 1 strings are returned as usual, and the count'th string is the remainder, unsplit. So split("-", "really-long-dylan-identifier", count: 3) returns values("really", "long", "dylan-identifier"). If remove-empty-items is true, empty items aren't counted. Case sensitive determines if the regexp for the delimiter should be considered case sensitive or not; it defaults to case-insensitive. Start: and end: indicate what part of the big string should be looked at for delimiters. They default to the entire string. For instance, split("-", "really-long-dylan-identifier", start: 8) returns values("really-long", "dylan", "identifier"). Caveat: Unlike Perl, empty regular expressions are never legal regular expressions, so there is no way to split a string into a #rest sequence-of-characters. Of course, in Dylan this is not a useful thing to do, so this is not really a problem. make-splitter [Generic Function] (pattern :: <string>, #key case-sensitive) => an anonymous splitter method (big-string, #key count, remove-empty-items, start, end) => buncha-strings Does what you would expect. join [Generic Function] (delimiter :: <string>, #rest strings) => big-string Does the opposite of a split. join(":", word1, word2, word3) is equivalent to concatenate(word1, ":", word2, ":", word3) (and no more efficient) Note that there is no make-joiner. Substring-search Substring search contains methods for searching for fixed substrings rather than general regular expressions. It is as similar to the regular expression module as we could make it. Substring functions work only on byte strings, and are always case sensitive. These functions were taken from the Collection-extensions library shipped in Mindy 1.1, but the parameters, keywords, and return values have changed significantly since then. substring-position [Generic Function] (big-string, search-for-string, #key start, end) => position-or-false; Returns the position of the search-for-string in the big-string (or that portion of the big-string specified by start: and end:). This search is always case sensitive. This function uses the Boyer-Moore algorithm for long strings, and a simple dumb search for short strings. It should yield good performance under all circumstances. make-substring-positioner (search-for-string) [Generic Function] => an anonymous positioner method (big-string, #key start, end) => position-or-false Does the obvious. substring-replace [Generic Function] (big-string, search-for-string, replace-with-string, #key count, start, end) => replaced-string Replaces the substring, or the first count instances of it if count: is specified. Note this function does not support start: or end:. make-substring-replacer [Generic Function] (search-for :: <byte-string>, #key replace-with) => an anonymous function replacer that is either method (big-string, #key count, start, end) => new-string or method (big-string, replace-with-string, #key count, start, end) Does the obvious. Known bugs Regular-expressions will do unpredictable things if given bad arguments. (ie, a string that isn't a legal regular expression) Sometimes it'll crash, and sometimes it'll merily chug away and return crazy answers. The regexp parser will happily accept a "quantified assertion," which isn't technically a legal regexp. However, both regular and compiled matching will handle it as one intuitively thinks it should be handled. (An example of a quantified assertion would be "^*", which matches "any number of beginning of line". Since "*" means "0 or more", "^*" is interpretted to mean "", which is how one would intuitively belive it should be interpretted.)